Web Crawler with browser_cookie3
Web Crawler with browser_cookie3
β PTT Gossiping Example
This guide demonstrates how to use browser_cookie3
to retrieve cookies from your browser and use them in a web crawler to access the PTT Gossiping board, which requires a over18=1
cookie to bypass the age restriction prompt.
π§ Requirementsβ
Install the required libraries:
pip install requests beautifulsoup4 browser-cookie3
π What is browser_cookie3
?β
browser_cookie3
is a Python library that allows you to programmatically access cookies stored by your browser (Chrome, Firefox, etc.). It can be useful when accessing websites that require session or consent cookies.
π Sample Crawler Codeβ
import requests
import browser_cookie3
from bs4 import BeautifulSoup
def _fetch_cookies_from_browser() -> dict[str, str]:
"""Fetches PTT cookies (like over18=1) from your Chrome browser."""
cookie_jar = browser_cookie3.chrome(domain_name="ptt.cc")
local_cookie_dict = requests.utils.dict_from_cookiejar(cookie_jar)
# Confirming if the essential cookie exists
cookies = {
"over18": local_cookie_dict.get("over18", "1")
}
print("[Fetched Cookies]", cookies)
return cookies
def crawl_ptt_gossiping():
url = "https://www.ptt.cc/bbs/Gossiping/index.html"
cookies = _fetch_cookies_from_browser()
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64)"
}
response = requests.get(url, headers=headers, cookies=cookies)
if response.status_code != 200:
print(f"Failed to retrieve page, status: {response.status_code}")
return
soup = BeautifulSoup(response.text, "html.parser")
articles = soup.select("div.title a")
print("\nπ₯ Latest Posts on PTT Gossiping:")
for article in articles:
print(f"- {article.text.strip()} ({article['href']})")
if __name__ == "__main__":
crawl_ptt_gossiping()
π‘ Notesβ
- If youβve never visited PTT Gossiping before, make sure to open the page once in Chrome and click "ζεζοΌζε·²εΉ΄ζ»Ώεε
«ζ²" to set the
over18=1
cookie. - You can extend this crawler to scrape post content, authors, dates, or write the data into a CSV/DB.